An Update on the Favorite Movies

Addressing Exploratory Questions developed by the 431 Class

Author

Thomas E. Love, Ph.D.

Modified

2025-10-04

1 Questions This Work Addresses

1.1 Data Management Questions

  • How do we ingest data from a Google Sheet?
  • How do we clean up some of the favorite movies data?
  • How might we deal with a “check all that apply” item like imdb_genres from our favorite movies?

1.2 Analytic Questions

  1. Do movies released in 1940-2005 have more user ratings than movies released after 2005?
  2. Which MPA categories have higher average IMDB star ratings?
  3. How strong is the association between the number of IMDB user ratings (imdb_ratings) and the weighted average star rating (imdb_stars)?
  4. Which movie genres have the highest weighted average star ratings on IMDB?

2 R Setup

knitr::opts_chunk$set(comment = NA)

library(janitor)
library(naniar)

library(googlesheets4) ## to read in data from shared Google Sheet
library(knitr); library(kableExtra) ## to make prettier tables
library(rstanarm) ## to fit Bayesian linear models
library(MKinfer)  ## for bootstrap t tests
library(infer)  ## for bootstrapping differences in medians
library(DescTools) ## for post-hoc testing in ANOVA
library(car) ## for boxCox
library(xfun)  ## for session information

library(easystats)
library(tidyverse)

theme_set(theme_bw())

source("Love-431.R") ## for lovedist() function

3 Ingesting the Data

The Google Sheet containing the movies_2025-09-30 data is found at the following URL:

https://docs.google.com/spreadsheets/d/1CnbaDAeFoTzvI_V7b3-C0Gy6c9sdNsFWztACLJ4PMAw/edit?usp=drive_link

gs4_deauth()

url_movies <- "https://docs.google.com/spreadsheets/d/1CnbaDAeFoTzvI_V7b3-C0Gy6c9sdNsFWztACLJ4PMAw/edit?usp=drive_link"

movies_raw <- read_sheet(url_movies) |> 
  janitor::clean_names()
✔ Reading from "movies_2025-09-30".
✔ Range 'movies_2025-09-30'.
dim(movies_raw)
[1] 260  11

This initial result should (and does) include 11 variables, and 260 movies.

3.1 Data Cleaning

3.1.1 Are there any missing values?

n_miss(movies_raw) ## any missing values?
[1] 0
Note

If we had any missing values indicated above, we’d probably want to run miss_var_summary() and miss_case_table() to understand the situation better, but that’s not necessary here.

3.1.2 Restricting to Key Variables and Minor Changes

Here’s a glimpse at our current tibble:

glimpse(movies_raw)
Rows: 260
Columns: 11
$ mov_id        <chr> "M-001", "M-002", "M-003", "M-004", "M-005", "M-006", "M…
$ movie         <list> "3 Idiots", "8 1/2", "10 Things I Hate About You", "20t…
$ year          <dbl> 2009, 1963, 1999, 2016, 2006, 1968, 2009, 2013, 1988, 19…
$ length        <dbl> 170, 138, 97, 119, 117, 149, 119, 123, 124, 117, 160, 16…
$ imdb_genres   <chr> "Comedy, Drama", "Drama", "Comedy, Drama, Romance", "Com…
$ imdb_ratings  <dbl> 467000, 130000, 430000, 52000, 898000, 764000, 60000, 42…
$ imdb_stars    <dbl> 8.4, 8.0, 7.4, 7.3, 7.6, 8.3, 7.9, 7.8, 8.0, 8.5, 8.4, 7…
$ mpa           <chr> "PG-13", "Not Rated", "PG-13", "R", "R", "G", "TV-PG", "…
$ imdb_synopsis <chr> "Two friends are searching for their long lost companion…
$ list_25       <dbl> 1, 0, 0, 1, 1, 0, 0, 0, 1, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0,…
$ imdb_link     <chr> "https://www.imdb.com/title/tt1187043", "https://www.imd…

We’re going to focus on 6 variables in our analyses today, plus the two identifying variables (mov_id and movie). They are year, length, imdb_genres, imdb_ratings, imdb_stars, and mpa. We’ll also:

  • rearrange the data a little,
  • ignore the variables we’re not using today,
  • look at ratings100K, which is the number of user ratings divided by 100,000, so that this is scaled more similarly to our other variables, and
  • represent the movie names as a character variable, rather than as a list, which is what R defaulted to when ingesting the data.
movies_1 <- movies_raw |>
  select(mov_id, year, length, imdb_ratings, imdb_stars, mpa, 
         imdb_genres, movie) |>
  mutate(ratings100K = imdb_ratings/100000) |>
  mutate(movie = as.character(movie))

glimpse(movies_1)
Rows: 260
Columns: 9
$ mov_id       <chr> "M-001", "M-002", "M-003", "M-004", "M-005", "M-006", "M-…
$ year         <dbl> 2009, 1963, 1999, 2016, 2006, 1968, 2009, 2013, 1988, 197…
$ length       <dbl> 170, 138, 97, 119, 117, 149, 119, 123, 124, 117, 160, 162…
$ imdb_ratings <dbl> 467000, 130000, 430000, 52000, 898000, 764000, 60000, 420…
$ imdb_stars   <dbl> 8.4, 8.0, 7.4, 7.3, 7.6, 8.3, 7.9, 7.8, 8.0, 8.5, 8.4, 7.…
$ mpa          <chr> "PG-13", "Not Rated", "PG-13", "R", "R", "G", "TV-PG", "R…
$ imdb_genres  <chr> "Comedy, Drama", "Drama", "Comedy, Drama, Romance", "Come…
$ movie        <chr> "3 Idiots", "8 1/2", "10 Things I Hate About You", "20th …
$ ratings100K  <dbl> 4.670, 1.300, 4.300, 0.520, 8.980, 7.640, 0.600, 4.200, 2…

3.1.3 Are the identifying variables distinct?

We want to make sure we have a different value of mov_id and movie for each row of our data.

nrow(movies_1)
[1] 260
n_distinct(movies_1$mov_id)
[1] 260
n_distinct(movies_1$movie)
[1] 260

OK. Each of these are 260, which is comforting.

3.1.4 Check Ranges of Quantities

We have four quantitative variables here, so let’s check the range of their values.

movies_1 |> reframe(range(imdb_stars), range(ratings100K), 
                range(length), range(year))
# A tibble: 2 × 4
  `range(imdb_stars)` `range(ratings100K)` `range(length)` `range(year)`
                <dbl>                <dbl>           <dbl>         <dbl>
1                 3.4                0.013              70          1940
2                 9.3               31                 207          2024
  • imdb_stars (weighted average star rating) must be between 1 and 10 stars, and the range we see is (3.4, 9.3). That seems reasonable.
  • ratings100K ranges from 0.13 (or 1,300 users) to 31 (or 3,100,000 users) who’ve rated the movie. Fine.
  • length of the movie ranges from 70 to 207 minutes. Seems OK.
  • year movie was released covers the period from 1940 to 2024. Also seems reasonable.

3.1.5 mpa is categorical

Let’s assess the levels of mpa rating, and consider how to create a factor representation of that information in R.

movies_1 |> count(mpa)
# A tibble: 10 × 2
   mpa           n
   <chr>     <int>
 1 Approved      2
 2 G             8
 3 Not Rated    14
 4 PG           67
 5 PG-13        84
 6 R            80
 7 TV-14         1
 8 TV-G          2
 9 TV-MA         1
10 TV-PG         1

Since only PG, PG-13 and R have fairly large sample sizes, let’s collapse all but the three most common categories together, and then create a four-level factor to represent mpa rating.

Fortunately, the forcats package (part of the tidyverse) has a tool for this task.

movies_1 <- movies_1 |>
  mutate(mpa4 = fct_lump_n(mpa, n = 3, other_level = "Other"))
Note

fct_lump_n() with n = 3 collapses all mpa values that occur less often than the top 3 mpa values.

Let’s do a little sanity check to ensure that the relationship between mpa and mpa4 matches what we expect.

movies_1 |> tabyl(mpa, mpa4) |> 
  adorn_totals(where = "row") |> adorn_title(placement = "combined")
  mpa/mpa4 PG PG-13  R Other
  Approved  0     0  0     2
         G  0     0  0     8
 Not Rated  0     0  0    14
        PG 67     0  0     0
     PG-13  0    84  0     0
         R  0     0 80     0
     TV-14  0     0  0     1
      TV-G  0     0  0     2
     TV-MA  0     0  0     1
     TV-PG  0     0  0     1
     Total 67    84 80    29

Looks good.

3.1.6 Dealing with imdb_genres: A bigger task

A genre is a category of a work of art defined by a particular set of shared conventions, content, form or style. These categories help organize and classify works based on common characteristics, allowing audiences and critics to understand what to expect from a piece.

Note

As we discussed in Class 09, each value in imdb_genres lists between 1 and 8 of the following 20 types of movie.

  • Action, Adventure, Animation, Biography, Comedy, Crime, Drama, Family, Fantasy, History,
  • Horror, Mystery, Music, Musical, Romance, Sci-Fi, Sport, Thriller, War, Western
movies_1 |> count(imdb_genres) |> arrange(desc(n))
# A tibble: 145 × 2
   imdb_genres                             n
   <chr>                               <int>
 1 Drama, Romance                         13
 2 Drama                                  12
 3 Comedy, Drama                          11
 4 Comedy, Drama, Romance                 11
 5 Comedy                                  8
 6 Action, Adventure, Fantasy, Sci-Fi      7
 7 Action, Adventure, Sci-Fi               7
 8 Action, Adventure, Thriller             5
 9 Action, Adventure, Sci-Fi, Thriller     4
10 Crime, Drama, Thriller                  4
# ℹ 135 more rows

In 260 movies, we see 145 different combinations of genres in the imdb_genres variable. As we see from this table, the most common result is that 13 movies list Drama and Romance, while, for example, 12 list only Drama.

3.1.7 Building Indicators for Genres

To represent these data in a more useful way, I suggest building a set of 20 indicator variables (which take the value 1 or 0) to indicate the presence (1) or absence (0) of each of the 20 available genres for that movie.

One way to do this (admittedly, a tedious bit of coding) is to use the str_detect() function from the stringr package (part of the tidyverse) as follows:

movies_1 <- movies_1 |>
  mutate(action = as.numeric(str_detect(imdb_genres, fixed("Action"))),
         adventure = as.numeric(str_detect(imdb_genres, fixed("Adventure"))),
         animation = as.numeric(str_detect(imdb_genres, fixed("Animation"))),
         biography = as.numeric(str_detect(imdb_genres, fixed("Biography"))),
         comedy = as.numeric(str_detect(imdb_genres, fixed("Comedy"))),
         crime = as.numeric(str_detect(imdb_genres, fixed("Crime"))),
         drama = as.numeric(str_detect(imdb_genres, fixed("Drama"))),
         family = as.numeric(str_detect(imdb_genres, fixed("Family"))),
         fantasy = as.numeric(str_detect(imdb_genres, fixed("Fantasy"))),
         history = as.numeric(str_detect(imdb_genres, fixed("History"))),
         horror = as.numeric(str_detect(imdb_genres, fixed("Horror"))),
         mystery = as.numeric(str_detect(imdb_genres, fixed("Mystery"))),
         music = as.numeric(str_detect(imdb_genres, fixed("Music"))),
         musical = as.numeric(str_detect(imdb_genres, fixed("Musical"))),
         romance = as.numeric(str_detect(imdb_genres, fixed("Romance"))),
         scifi = as.numeric(str_detect(imdb_genres, fixed("Sci-Fi"))),
         sport = as.numeric(str_detect(imdb_genres, fixed("Sport"))),
         thriller = as.numeric(str_detect(imdb_genres, fixed("Thriller"))),
         war = as.numeric(str_detect(imdb_genres, fixed("War"))),
         western = as.numeric(str_detect(imdb_genres, fixed("Western"))),
  )

Let’s use these new variables to obtain counts of the number of films associated with each of the 20 genres. Since each value is either 1 (if that genre is in the movie’s list) or 0, we can sum up the columns to get these counts.

movies_1 |> select(action:war) |> colSums()
   action adventure animation biography    comedy     crime     drama    family 
       60        87        29        17       103        34       153        44 
  fantasy   history    horror   mystery     music   musical   romance     scifi 
       60         6        12        26        30        17        62        51 
    sport  thriller       war 
        6        50        11 

Now, we can answer questions like:

How many movies are categorized as both Drama and Mystery?

movies_1 |> count(drama, mystery)
# A tibble: 4 × 3
  drama mystery     n
  <dbl>   <dbl> <int>
1     0       0    95
2     0       1    12
3     1       0   139
4     1       1    14

Can we list the 14 movies that are categorized as both Drama and Mystery (and perhaps other things)?

movies_1 |> filter(drama == 1, mystery == 1) |> 
  select(movie, imdb_stars, imdb_genres) 
# A tibble: 14 × 3
   movie                           imdb_stars imdb_genres                       
   <chr>                                <dbl> <chr>                             
 1 About Elly (Darbareye Elly)            7.9 Drama, Mystery                    
 2 Blade Runner 2049                      8   Action, Drama, Mystery, Sci-Fi, T…
 3 Blue Velvet                            7.7 Crime, Drama, Mystery, Thriller   
 4 Cloud Atlas                            7.4 Drama, Mystery, Sci-Fi, Thriller  
 5 Coco                                   8.4 Animation, Adventure, Drama, Fami…
 6 The Conversation                       7.7 Drama, Mystery, Thriller          
 7 Divergent                              6.6 Action, Adventure, Drama, Mystery…
 8 The Girl with the Dragon Tattoo        7.8 Crime, Drama, Mystery, Thriller   
 9 The Green Mile                         8.6 Crime, Drama, Fantasy, Mystery    
10 The Hateful Eight                      7.8 Crime, Drama, Mystery, Thriller, …
11 Hereditary                             7.3 Drama, Horror, Mystery, Thriller  
12 Memento                                8.4 Drama, Mystery, Thriller          
13 The Prestige                           8.5 Drama, Mystery, Sci-Fi, Thriller  
14 The Rite                               6   Drama, Horror, Mystery, Thriller  

Are there any movies that include Comedy, Romance and Thriller?

movies_1 |> count(comedy, romance, thriller)
# A tibble: 8 × 4
  comedy romance thriller     n
   <dbl>   <dbl>    <dbl> <int>
1      0       0        0    84
2      0       0        1    45
3      0       1        0    26
4      0       1        1     2
5      1       0        0    67
6      1       0        1     2
7      1       1        0    33
8      1       1        1     1
movies_1 |> filter(comedy == 1, romance == 1, thriller == 1) |> 
  select(movie, imdb_stars, imdb_genres) 
# A tibble: 1 × 3
  movie                            imdb_stars imdb_genres                       
  <chr>                                 <dbl> <chr>                             
1 Om Shanti Om (Peace Be With You)        6.8 Action, Comedy, Drama, Fantasy, M…
Note

Suppose we wanted to build a tibble of the genre counts. Consider the following approach.

genre_counts <- movies_1 |> 
  select(action:western) |>
  colSums() |> 
  t() |> as_tibble() |> pivot_longer(action:western) |>
  rename(genre = name, movies = value) |> 
  arrange(desc(movies))

genre_counts
# A tibble: 20 × 2
   genre     movies
   <chr>      <dbl>
 1 drama        153
 2 comedy       103
 3 adventure     87
 4 romance       62
 5 action        60
 6 fantasy       60
 7 scifi         51
 8 thriller      50
 9 family        44
10 crime         34
11 music         30
12 animation     29
13 mystery       26
14 biography     17
15 musical       17
16 horror        12
17 war           11
18 history        6
19 sport          6
20 western        2

3.2 Numerical Summaries

3.2.1 data_codebook results

Here’s a brief description of each of the variables that we’re now thinking about using in an analysis, leaving out the identifying variables, and the original versions of the imdb_ratings, mpa and imdb_genres variables.

data_codebook(movies_1 |> 
                select(-mov_id, -movie, -imdb_ratings, -mpa, -imdb_genres))
select(movies_1, -mov_id, -movie, -imdb_ratings, -mpa, -imdb_genres) (260 rows and 25 variables, 25 shown)

ID | Name        | Type        | Missings |       Values |           N
---+-------------+-------------+----------+--------------+------------
1  | year        | numeric     | 0 (0.0%) | [1940, 2024] |         260
---+-------------+-------------+----------+--------------+------------
2  | length      | numeric     | 0 (0.0%) |    [70, 207] |         260
---+-------------+-------------+----------+--------------+------------
3  | imdb_stars  | numeric     | 0 (0.0%) |   [3.4, 9.3] |         260
---+-------------+-------------+----------+--------------+------------
4  | ratings100K | numeric     | 0 (0.0%) |   [0.01, 31] |         260
---+-------------+-------------+----------+--------------+------------
5  | mpa4        | categorical | 0 (0.0%) |           PG |  67 (25.8%)
   |             |             |          |        PG-13 |  84 (32.3%)
   |             |             |          |            R |  80 (30.8%)
   |             |             |          |        Other |  29 (11.2%)
---+-------------+-------------+----------+--------------+------------
6  | action      | numeric     | 0 (0.0%) |            0 | 200 (76.9%)
   |             |             |          |            1 |  60 (23.1%)
---+-------------+-------------+----------+--------------+------------
7  | adventure   | numeric     | 0 (0.0%) |            0 | 173 (66.5%)
   |             |             |          |            1 |  87 (33.5%)
---+-------------+-------------+----------+--------------+------------
8  | animation   | numeric     | 0 (0.0%) |            0 | 231 (88.8%)
   |             |             |          |            1 |  29 (11.2%)
---+-------------+-------------+----------+--------------+------------
9  | biography   | numeric     | 0 (0.0%) |            0 | 243 (93.5%)
   |             |             |          |            1 |  17 ( 6.5%)
---+-------------+-------------+----------+--------------+------------
10 | comedy      | numeric     | 0 (0.0%) |            0 | 157 (60.4%)
   |             |             |          |            1 | 103 (39.6%)
---+-------------+-------------+----------+--------------+------------
11 | crime       | numeric     | 0 (0.0%) |            0 | 226 (86.9%)
   |             |             |          |            1 |  34 (13.1%)
---+-------------+-------------+----------+--------------+------------
12 | drama       | numeric     | 0 (0.0%) |            0 | 107 (41.2%)
   |             |             |          |            1 | 153 (58.8%)
---+-------------+-------------+----------+--------------+------------
13 | family      | numeric     | 0 (0.0%) |            0 | 216 (83.1%)
   |             |             |          |            1 |  44 (16.9%)
---+-------------+-------------+----------+--------------+------------
14 | fantasy     | numeric     | 0 (0.0%) |            0 | 200 (76.9%)
   |             |             |          |            1 |  60 (23.1%)
---+-------------+-------------+----------+--------------+------------
15 | history     | numeric     | 0 (0.0%) |            0 | 254 (97.7%)
   |             |             |          |            1 |   6 ( 2.3%)
---+-------------+-------------+----------+--------------+------------
16 | horror      | numeric     | 0 (0.0%) |            0 | 248 (95.4%)
   |             |             |          |            1 |  12 ( 4.6%)
---+-------------+-------------+----------+--------------+------------
17 | mystery     | numeric     | 0 (0.0%) |            0 | 234 (90.0%)
   |             |             |          |            1 |  26 (10.0%)
---+-------------+-------------+----------+--------------+------------
18 | music       | numeric     | 0 (0.0%) |            0 | 230 (88.5%)
   |             |             |          |            1 |  30 (11.5%)
---+-------------+-------------+----------+--------------+------------
19 | musical     | numeric     | 0 (0.0%) |            0 | 243 (93.5%)
   |             |             |          |            1 |  17 ( 6.5%)
---+-------------+-------------+----------+--------------+------------
20 | romance     | numeric     | 0 (0.0%) |            0 | 198 (76.2%)
   |             |             |          |            1 |  62 (23.8%)
---+-------------+-------------+----------+--------------+------------
21 | scifi       | numeric     | 0 (0.0%) |            0 | 209 (80.4%)
   |             |             |          |            1 |  51 (19.6%)
---+-------------+-------------+----------+--------------+------------
22 | sport       | numeric     | 0 (0.0%) |            0 | 254 (97.7%)
   |             |             |          |            1 |   6 ( 2.3%)
---+-------------+-------------+----------+--------------+------------
23 | thriller    | numeric     | 0 (0.0%) |            0 | 210 (80.8%)
   |             |             |          |            1 |  50 (19.2%)
---+-------------+-------------+----------+--------------+------------
24 | war         | numeric     | 0 (0.0%) |            0 | 249 (95.8%)
   |             |             |          |            1 |  11 ( 4.2%)
---+-------------+-------------+----------+--------------+------------
25 | western     | numeric     | 0 (0.0%) |            0 | 258 (99.2%)
   |             |             |          |            1 |   2 ( 0.8%)
----------------------------------------------------------------------

3.2.2 lovedist results on quantities

Let’s look at some general numerical summaries for the four quantitative variables (not including the binary indicator variables) we are analyzing.

dat1 <- bind_rows(
  movies_1 |> reframe(lovedist(imdb_stars)),
  movies_1 |> reframe(lovedist(ratings100K)),
  movies_1 |> reframe(lovedist(length)),
  movies_1 |> reframe(lovedist(year))
)

dat1 <- dat1 |> 
  mutate(dat1, var_name = c("imdb_stars", "ratings100K", "length", "year")) |>
  relocate(var_name)

kable(dat1, digits = 1) 
var_name n miss mean sd med mad min q25 q75 max
imdb_stars 260 0 7.6 0.9 7.7 0.7 3.4 7.1 8.1 9.3
ratings100K 260 0 5.9 5.9 3.7 4.0 0.0 1.7 8.4 31.0
length 260 0 122.8 24.9 118.0 23.7 70.0 103.8 136.2 207.0
year 260 0 2002.7 14.8 2005.5 12.6 1940.0 1996.0 2013.0 2024.0

4 Question 1

  • Do movies released in 1940-2005 have more user ratings than movies released after 2005?

Let’s build an appropriate factor called release to identify the two groups of movies being compared here, then obtain some numeric summaries of ratings100K (hundreds of thousands of user ratings) in each release group.

movies_1 <- movies_1 |>
  mutate(release = fct_recode(factor(year > 2005),
                            After2005 = "TRUE", Older = "FALSE"))

movies_1 |> group_by(release) |> reframe(lovedist(ratings100K)) |>
  kable(digits = 1)
release n miss mean sd med mad min q25 q75 max
Older 130 0 6.3 6.2 3.7 3.8 0.1 1.8 9.3 31
After2005 130 0 5.5 5.4 4.0 4.4 0.0 1.4 7.6 31

We see we have a balanced design here, with 130 movies in each release category. Let’s draw a picture to compare these independent samples.

ggplot(movies_1, aes(x = release, y = ratings100K)) +
  geom_violin(aes(fill = release)) +
  geom_boxplot(width = 0.2, notch = TRUE) + 
  stat_summary(fun = mean, geom = "point", size = 3, col = "red") +
  scale_fill_oi() +
  guides(fill = "none") +
  labs(title = "Favorite Movies (2020-2025)",
       x = "Release Category", y = "Hundreds of Thousands of IMDB User Ratings") +
  coord_flip()

Each distribution looks strongly right-skewed. So we have several available options for our analysis.

4.1 Transform the outcome

In an attempt to reduce the skew, I’d be interested in transforming the original count of user ratings to establish a new outcome. Let’s try that picture with a log transformation.

4.1.1 Logarithmic transformation?

ggplot(movies_1, aes(x = release, y = log(imdb_ratings))) +
  geom_violin(aes(fill = release)) +
  geom_boxplot(width = 0.2, notch = TRUE) + 
  stat_summary(fun = mean, geom = "point", size = 3, col = "red") +
  scale_fill_oi() +
  guides(fill = "none") +
  labs(title = "Favorite Movies (2020-2025)",
       x = "Release Category", y = "Logarithm of the # of IMDB User Ratings") +
  coord_flip()

Hmmmm… this looks a little disappointing. Have we transformed too much? Should we instead consider the square root of the number of IMDB ratings?

4.1.2 Square root transformation?

ggplot(movies_1, aes(x = release, y = sqrt(imdb_ratings))) +
  geom_violin(aes(fill = release)) +
  geom_boxplot(width = 0.2, notch = TRUE) + 
  stat_summary(fun = mean, geom = "point", size = 3, col = "red") +
  scale_fill_oi() +
  guides(fill = "none") +
  labs(title = "Favorite Movies (2020-2025)",
       x = "Release Category", y = "Square Root of # of IMDB User Ratings") +
  coord_flip()

Well, that seems much closer to a pair of distributions that could be reasonably well approximated by the Normal distribution.

4.2 Pooled t-test / linear model on sqrt(imdb_ratings)

So, let’s go ahead and run a linear model to obtain an appropriate two-sample pooled t test comparing the means of sqrt(ratings) across the two release groups.

fit1 <- lm(sqrt(imdb_ratings) ~ release, data = movies_1)

model_parameters(fit1, ci = 0.95, pretty_names = FALSE) 
Parameter        | Coefficient |    SE |            95% CI | t(258) |      p
----------------------------------------------------------------------------
(Intercept)      |      699.77 | 31.48 | [ 637.78, 761.76] |  22.23 | < .001
releaseAfter2005 |      -45.11 | 44.52 | [-132.77,  42.55] |  -1.01 | 0.312 

Uncertainty intervals (equal-tailed) and p-values (two-tailed) computed
  using a Wald t-distribution approximation.

For a movie released after 2005, our model fit1 estimates that this movie’s square root of number of IMDB user ratings will be, on average, 45.11 smaller than for a movie released in 1940-2005. There’s no concern here about using a pooled t approach (as opposed to the Welch t approach) given the balanced design.

Suppose we assume that the population of interest is “all favorite movies for students at CWRU in any class between 2020 and 2025” and we’re willing to believe that our sample of 260 movies is a representative (though not random) sample from that population. When we generalize beyond the movies selected here to the population of interest, and we assume that our model fit1 is correct, then our sample data are compatible (at the 95% confidence level) with differences in the square root of number of IMDB user ratings between 132.77 smaller in the “after 2005” group up to 42.55 larger in the “after 2025” group.

The frustrating point here is that we’re now talking about the square root of rankings, rather than just rankings. Hence, we might want to make some average predictions for each group, back on our original scale.

estimate_means(fit1, by = "release", ci = 0.95, transform = TRUE)
Estimated Marginal Means

release   |     Mean |               95% CI |  df
-------------------------------------------------
Older     | 4.90e+05 | [4.07e+05, 5.80e+05] | 258
After2005 | 4.29e+05 | [3.51e+05, 5.14e+05] | 258

Variable predicted: imdb_ratings
Predictors modulated: release

So that’s a point estimate using this model back-transformed (by taking the square of the mean) out of the square root transformation which yield the estimates:

  • 490,000 user ratings on average in the “Older” group with 95% CI (407,000, 580,000)
  • and 429,000 user ratings on average in the “After 2005” group with 95% CI (351,000, 514,000).

4.3 Bayesian linear model?

We could have run a Bayesian model instead here, although it has the same issues with assumptions and the transformation as our fit1 model.

set.seed(43101)
fit1b <- stan_glm(sqrt(imdb_ratings) ~ release, 
                  data = movies_1, refresh = 0 )

model_parameters(fit1b, ci = 0.95, pretty_names = FALSE)
Parameter        | Median |            95% CI |     pd |  Rhat |  ESS |                     Prior
-------------------------------------------------------------------------------------------------
(Intercept)      | 699.21 | [ 637.85, 762.95] |   100% | 1.000 | 4014 | Normal (677.21 +- 897.31)
releaseAfter2005 | -44.41 | [-133.16,  42.54] | 83.50% | 1.000 | 4015 |  Normal (0.00 +- 1791.17)

Uncertainty intervals (equal-tailed) computed using a MCMC distribution
  approximation.
estimate_means(fit1b, by = "release", ci = 0.95, transform = TRUE)
Estimated Marginal Means

release   |   Median |               95% CI |   pd |          ROPE | % in ROPE
------------------------------------------------------------------------------
Older     | 4.89e+05 | [4.07e+05, 5.82e+05] | 100% | [-0.10, 0.10] |        0%
After2005 | 4.29e+05 | [3.52e+05, 5.17e+05] | 100% | [-0.10, 0.10] |        0%

Variable predicted: imdb_ratings
Predictors modulated: release

Does this look like a meaningfully different set of results than we obtained with fit1? No, as it probably shouldn’t, given our choice of a weakly informative prior.

4.4 Use the bootstrap

Given the skew in the original data, we might consider applying the bootstrap if we still wanted to compare means. We’ll pool the variance estimates here with var.equal = TRUE, since we have a balanced design.

fit2 <- boot.t.test(imdb_ratings ~ release, data = movies_1, 
                    var.equal = TRUE, conf.level = 0.95)

fit2

    Bootstrap Two Sample t-test

data:  imdb_ratings by release
number of bootstrap samples:  9999
bootstrap p-value = 0.2732 
bootstrap difference of means (SE) = 80545.43 (72265.29) 
95 percent bootstrap percentile confidence interval:
 -60801.46 223390.85

Results without bootstrap:
t = 1.1079, df = 258, p-value = 0.2689
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
 -62486.79 223257.56
sample estimates:
    mean in group Older mean in group After2005 
               627142.3                546756.9 

For a movie released in 1940-2005 (the “older” period), our bootstrap fit2 estimates that this movie’s number of IMDB user ratings will be, on average, 80063 larger than for a movie released after 2005.

Suppose we assume again that the population of interest is “all favorite movies for students at CWRU in any class between 2020 and 2025” and we’re willing to believe that our sample of 260 movies is a representative (though not random) sample from that population.

When we generalize beyond the movies selected here to the population of interest, and we assume that our model fit2 is correct, then our sample data are compatible (at the 95% confidence level) with differences in the mean of IMDB user ratings between 62,947 smaller in the “older” group up to 223,437 larger in the “older” group.

This is appealing because it’s still an estimate about the mean of IMDB user ratings, but it’s debatable that the skew in our data lets us interpret the mean as the “center” of the distribution. We might prefer to look at a bootstrap confidence interval comparing the medians of the two groups.

The sample median imdb_ratings within the two groups are:

movies_1 |> group_by(release) |> 
  summarize(n = n(), median(imdb_ratings), mean(imdb_ratings))
# A tibble: 2 × 4
  release       n `median(imdb_ratings)` `mean(imdb_ratings)`
  <fct>     <int>                  <dbl>                <dbl>
1 Older       130                 368000              627142.
2 After2005   130                 399500              546757.

So I’ll set this interval up to look at the positive difference (399500 - 368000) obtained from “After2005” - “Older” comparing medians. Note that the sample means show the opposite sign as compared to this difference in medians.

set.seed(431)

sample_diff_in_meds <- movies_1 |>
  specify(imdb_ratings ~ release) |>
  calculate(stat = "diff in medians", 
            order = c("After2005", "Older") )

fit3 <- movies_1 |>
  specify(imdb_ratings ~ release) |>
  generate(reps = 2000, type = "bootstrap") |>
  calculate(stat = "diff in medians", 
            order = c("After2005", "Older") ) |>
  get_ci(level = 0.95, type = "percentile") 

fit3 <- fit3 |> 
  mutate(point_est = sample_diff_in_meds) |>
  relocate(point_est)

fit3
# A tibble: 1 × 3
  point_est$stat lower_ci upper_ci
           <dbl>    <dbl>    <dbl>
1          31500  -111025  187012.

Across movies released after 2005, our bootstrap fit3 estimates that the median number of IMDB user ratings will be, on average, 31,500 larger than the median for movies released between 1940 and 2005 (the “older” period.)

Suppose we assume once again that the population of interest is “all favorite movies for students at CWRU in any class between 2020 and 2025” and we’re willing to believe that our sample of 260 movies is a representative (though not random) sample from that population.

When we generalize beyond the movies selected here to the population of interest, and we assume that our model fit3 is correct, then our sample data are compatible (at the 95% confidence level) with differences in the median of IMDB user ratings between 111,025 smaller in the “After2005” group up to 187,012 larger in the “After2005” group.

If we needed it, we could estimate a p value here with:

movies_1 |>
  specify(imdb_ratings ~ release) |>
  generate(reps = 2000, type = "bootstrap") |>
  calculate(stat = "diff in medians", 
            order = c("After2005", "Older") ) |>
  get_pvalue(obs_stat = sample_diff_in_meds, direction = "both") 
# A tibble: 1 × 1
  p_value
    <dbl>
1   0.985

4.5 Use a non-parametric Wilcoxon rank sum test

Yet another approach we might consider here is a Wilcoxon rank sum test. This test compares the locations without using either the mean or the median of the original differences, but instead first ranks the observed imdb_ratings values regardless of release group, and then compares the centers (locations) of those distributions.

wilcox.test(imdb_ratings ~ release, data = movies_1, 
            exact = FALSE, conf.int = TRUE, conf.level = 0.95)

    Wilcoxon rank sum test with continuity correction

data:  imdb_ratings by release
W = 8927.5, p-value = 0.4314
alternative hypothesis: true location shift is not equal to 0
95 percent confidence interval:
 -53000 123000
sample estimates:
difference in location 
                 31000 

Of course, these results don’t describe either the difference in means or the difference in medians of the release groups on our outcome (imdb_ratings), so they’re meaningfully harder to think about, unless you’re only concerned with the p value, which is also pretty large, as we saw with the other methods shown above.

The table below shows point estimates and confidence intervals for the After2005 - Older difference in the statistic being compared:

Method Point Estimate 95% CI p value
Compare means of square root of imdb_ratings -45.11 (-133, 43) 0.312
Bootstrap t test on means of imdb_ratings with equal variances assumed -80678 (-222142, 62460) 0.268
Pooled t test on means of imdb_ratings -80678 (-223258, 62487) 0.269
Bootstrap on medians of imdb_ratings 31500 (-111025, 187012) 0.985
Wilcoxon rank sum test comparing locations of imdb_ratings 31000 (-53000, 123000) 0.431

4.6 Answering Question 1

We have some conflicting results. Comparing sample means suggests that the older movies have slightly higher numbers of ratings, while comparing sample medians suggest that the movies released after 2005 have a few more ratings. In either case, though, the difference between the older and more recent movies after modeling appears to be small, relative to the variation in our data.

5 Question 2

  • Which MPA categories have higher average IMDB star ratings? (mpa4 and imdb_stars)

We’ll restrict ourselves to our 4-category description of the mpa categories, and here’s a numerical summary of imdb_stars within each of those categories.

movies_1 |> group_by(mpa4) |> reframe(lovedist(imdb_stars)) |>
  kable(digits = 2)
mpa4 n miss mean sd med mad min q25 q75 max
PG 67 0 7.57 0.66 7.70 0.59 5.8 7.20 8.0 8.7
PG-13 84 0 7.42 0.85 7.50 0.89 4.6 6.77 8.0 9.1
R 80 0 7.79 0.84 7.95 0.52 3.6 7.47 8.3 9.3
Other 29 0 7.35 1.14 7.70 0.59 3.4 7.10 8.0 8.6

So we don’t have a balanced design here. Let’s draw a comparison boxplot.

ggplot(movies_1, aes(x = mpa4, y = imdb_stars)) +
  geom_violin(aes(fill = mpa4)) +
  geom_boxplot(width = 0.2, notch = TRUE) + 
  stat_summary(fun = mean, geom = "point", size = 3, col = "red") +
  scale_fill_viridis_d(alpha = 0.4) +
  guides(fill = "none") +
  labs(title = "Favorite Movies (2020-2025)",
       x = "MPA Category", y = "IMDB Star Rating") 

5.1 Analysis of Variance model

Let’s run an ANOVA model to compare the mean star ratings in each of these four mpa4 categories. Notice that the anova() and aov() functions provide some of the same information, just arranged differently.

fit4 <- lm(imdb_stars ~ mpa4, data = movies_1)

anova(fit4)
Analysis of Variance Table

Response: imdb_stars
           Df Sum Sq Mean Sq F value  Pr(>F)  
mpa4        3   7.16 2.38674   3.365 0.01925 *
Residuals 256 181.57 0.70927                  
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
aov(fit4)
Call:
   aov(formula = fit4)

Terms:
                     mpa4 Residuals
Sum of Squares    7.16023 181.57424
Deg. of Freedom         3       256

Residual standard error: 0.8421843
Estimated effects may be unbalanced
eta_squared(fit4)
For one-way between subjects designs, partial eta squared is equivalent
  to eta squared. Returning eta squared.
# Effect Size for ANOVA

Parameter | Eta2 |       95% CI
-------------------------------
mpa4      | 0.04 | [0.00, 1.00]

- One-sided CIs: upper bound fixed at [1.00].

Model fit4 estimates that the proportion of variation in imdb_stars explained by the mpa4 ranking category is only 4%, which doesn’t sound like much, even though the ANOVA F test has a relatively small p value.

5.1.1 Model Equation yields the Sample Means

In light of these results, let’s dig a little more deeply into this ANOVA model.

model_parameters(fit4, ci = 0.95, pretty_names = FALSE)
Parameter   | Coefficient |   SE |        95% CI | t(256) |      p
------------------------------------------------------------------
(Intercept) |        7.57 | 0.10 | [ 7.37, 7.77] |  73.58 | < .001
mpa4PG-13   |       -0.15 | 0.14 | [-0.42, 0.12] |  -1.10 | 0.274 
mpa4R       |        0.22 | 0.14 | [-0.05, 0.49] |   1.58 | 0.116 
mpa4Other   |       -0.22 | 0.19 | [-0.59, 0.15] |  -1.17 | 0.244 

Uncertainty intervals (equal-tailed) and p-values (two-tailed) computed
  using a Wald t-distribution approximation.
estimate_means(fit4, ci = 0.95, by = "mpa4")
Estimated Marginal Means

mpa4  | Mean |   SE |       95% CI | t(256)
-------------------------------------------
PG    | 7.57 | 0.10 | [7.37, 7.77] |  73.58
PG-13 | 7.42 | 0.09 | [7.24, 7.60] |  80.74
R     | 7.79 | 0.09 | [7.60, 7.98] |  82.73
Other | 7.35 | 0.16 | [7.04, 7.66] |  47.01

Variable predicted: imdb_stars
Predictors modulated: mpa4

We see that the model reproduces the sample means of the four groups, as it should, and that R has the highest sample mean while Other has the lowest across these four groups.

5.1.2 Contrasts

We can run contrasts to compare all pairs of means if we’re not concerned about the multiple comparisons problem like this.

estimate_contrasts(fit4, ci = 0.95, contrast = "mpa4")
Marginal Contrasts Analysis

Level1 | Level2 | Difference |   SE |         95% CI | t(256) |     p
---------------------------------------------------------------------
PG-13  | PG     |      -0.15 | 0.14 | [-0.42,  0.12] |  -1.10 | 0.274
R      | PG     |       0.22 | 0.14 | [-0.05,  0.49] |   1.58 | 0.116
Other  | PG     |      -0.22 | 0.19 | [-0.59,  0.15] |  -1.17 | 0.244
R      | PG-13  |       0.37 | 0.13 | [ 0.11,  0.63] |   2.82 | 0.005
Other  | PG-13  |      -0.07 | 0.18 | [-0.42,  0.29] |  -0.37 | 0.711
Other  | R      |      -0.44 | 0.18 | [-0.80, -0.08] |  -2.40 | 0.017

Variable predicted: imdb_stars
Predictors contrasted: mpa4
p-values are uncorrected.

5.1.3 Model Performance

model_performance(fit4)
# Indices of model performance

AIC   |  AICc |   BIC |    R2 | R2 (adj.) |  RMSE | Sigma
---------------------------------------------------------
654.5 | 654.7 | 672.3 | 0.038 |     0.027 | 0.836 | 0.842

As mentioned previously, the \(\eta^2\) (“eta-squared” or, more generally, R-squared) in this ANOVA model is only 3.8%, and we can see that the \(\sigma\) value is 0.842, implying that about 68% of this model’s predicted imdb_stars results should be within 0.842 of the actual imdb_stars value, and about 95% of the predictions should be within (0.842 x 2) = 1.684 stars. That also doesn’t sound great, in light of the fact that 75% of the movies have star ratings between 7.1 and 9.3, for example.

movies_1 |> reframe(lovedist(imdb_stars)) |> kable(digits = 2)
n miss mean sd med mad min q25 q75 max
260 0 7.56 0.85 7.7 0.74 3.4 7.1 8.1 9.3

5.2 Post-Hoc Bonferroni comparisons

Here are the family-wise 95% Bonferroni confidence intervals for each pairwise difference of means.

PostHocTest(aov(fit4), method = "bonferroni", conf.level = 0.95)

  Posthoc multiple comparisons of means : Bonferroni 
    95% family-wise confidence level

$mpa4
                   diff     lwr.ci     upr.ci   pval    
PG-13-PG    -0.15110163 -0.5178966 0.21569331 1.0000    
R-PG         0.21985075 -0.1509906 0.59069212 0.6971    
Other-PG    -0.21842512 -0.7161750 0.27932472 1.0000    
R-PG-13      0.37095238  0.0211287 0.72077606 0.0311 *  
Other-PG-13 -0.06732348 -0.5496182 0.41497122 1.0000    
Other-R     -0.43827586 -0.9236551 0.04710335 0.1024    

---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Here’s a plot of those confidence intervals.

par(mar=c(3,6,3,3), las = 1)
plot(PostHocTest(aov(fit4), method = "bonferroni", conf.level = 0.95))

It appears that the comparison of R vs. PG-13 is the only one with a confidence interval here that doesn’t include 0.

5.3 Post-Hoc Tukey HSD comparisons

A Tukey HSD approach is hard to justify here, since the design isn’t even close to being balanced. We can still run it, but I would likely use the Bonferroni intervals here.

PostHocTest(aov(fit4), method = "hsd", conf.level = 0.95)

  Posthoc multiple comparisons of means : Tukey HSD 
    95% family-wise confidence level

$mpa4
                   diff     lwr.ci     upr.ci   pval    
PG-13-PG    -0.15110163 -0.5078357 0.20563240 0.6927    
R-PG         0.21985075 -0.1408187 0.58052022 0.3940    
Other-PG    -0.21842512 -0.7025221 0.26567182 0.6483    
R-PG-13      0.37095238  0.0307241 0.71118066 0.0265 *  
Other-PG-13 -0.06732348 -0.5363892 0.40174224 0.9825    
Other-R     -0.43827586 -0.9103415 0.03378976 0.0795 .  

---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
par(mar=c(3,6,3,3), las = 1)
plot(PostHocTest(aov(fit4), method = "hsd", conf.level = 0.95))

Based on the Tukey HSD approach, it again appears that the comparison of R vs. PG-13 is the only one with a confidence interval here that doesn’t include 0.

5.4 Assessing ANOVA assumptions

The ANOVA assumptions are the same as any linear model, so we can run that complete set of diagnostic plots and see what’s happening.

check_model(fit4, detrend = FALSE)

  • The posterior predictive check isn’t terrible, but our model is predicting fewer results in the 8 to 8.5 star range than we see in the data, and more results above 9.0.
  • The linearity plot shows no issues with this one-factor ANOVA.
  • The homogeneity of variance plot suggests a potential issue with the group having fitted values (remember this is just a sample mean) below 7.4 (so that’s the “Other” group) showing a bit more variation than the other groups. We saw this earlier in the sample standard deviations in each group (and remember that “Other” also has many fewer movies than the other three groups.)
  • There are no highly influential observations affecting this ANOVA model.
  • The Normality of the ANOVA residuals looks like our biggest problem, with some left skew, especially shown by the dip below the line on the left of the plot.

On the whole, though, I’d be inclined to stick with this model for now. The fundamental conclusion here is that the mpa4 groups, by themselves, do a poor job of predicting imdb_stars.

5.5 Bayesian model?

We could have run a Bayesian model instead here, although it has the same issues with assumptions and the transformation as our fit4 model.

set.seed(43104)
fit4b <- stan_glm(imdb_stars ~ mpa4, 
                  data = movies_1, refresh = 0)

model_parameters(fit4b, ci = 0.95, pretty_names = FALSE)
Parameter   | Median |        95% CI |     pd |  Rhat |  ESS |                 Prior
------------------------------------------------------------------------------------
(Intercept) |   7.57 | [ 7.37, 7.77] |   100% | 1.001 | 2727 | Normal (7.56 +- 2.13)
mpa4PG-13   |  -0.15 | [-0.42, 0.11] | 87.02% | 1.000 | 3086 | Normal (0.00 +- 4.55)
mpa4R       |   0.22 | [-0.06, 0.50] | 94.62% | 1.002 | 2980 | Normal (0.00 +- 4.61)
mpa4Other   |  -0.22 | [-0.59, 0.14] | 87.33% | 1.000 | 3208 | Normal (0.00 +- 6.77)

Uncertainty intervals (equal-tailed) computed using a MCMC distribution
  approximation.

Does this look like a meaningfully different set of results than we obtained with fit4? No, and we don’t (yet) have an easy way to even get an ANOVA table in this case.

5.6 Answering Question 2

Answering the question for our sample, though, it looks like the R-rated movies have slightly higher mean imdb_stars than do the PG-13 movies, using a 95% Bonferroni comparison, although the model is not strong at all.

6 Question 3

How strong is the association between hundreds of thousands of user ratings (ratings100K) and the weighted average star rating (imdb_stars)?

dat2 <- bind_rows(
  movies_1 |> reframe(lovedist(imdb_stars)),
  movies_1 |> reframe(lovedist(ratings100K))
)

dat2 <- dat2 |> 
  mutate(dat2, var_name = c("imdb_stars", "ratings100K")) |>
  relocate(var_name)

kable(dat2, digits = 1) 
var_name n miss mean sd med mad min q25 q75 max
imdb_stars 260 0 7.6 0.9 7.7 0.7 3.4 7.1 8.1 9.3
ratings100K 260 0 5.9 5.9 3.7 4.0 0.0 1.7 8.4 31.0

Let’s look at imdb_stars (on the y axis) and ratings100K (hundreds of thousands of IMDB user ratings) on the x axis.

ggplot(movies_1, aes(x = ratings100K, y = imdb_stars)) +
  geom_point() +
  geom_smooth(method = "loess", 
              formula = y ~ x, se = FALSE, col = "blue") +
  geom_smooth(method = "lm", 
              formula = y ~ x, se = TRUE, col = "red") +
  labs(x = "Hundreds of Thousands of IMDB ratings",
       y = "Weighted average star rating")

The linear model (in red) looks like it might work pretty well, once we get between 500,000 and 2,000,000 IMDB ratings, but isn’t doing as well at the low and high ends of that variable.

6.1 Pearson correlation

The Pearson correlation coefficient might also help.

movies_1 |> select(ratings100K, imdb_stars) |> correlation()
# Correlation Matrix (pearson-method)

Parameter1  | Parameter2 |    r |       95% CI | t(258) |         p
-------------------------------------------------------------------
ratings100K | imdb_stars | 0.62 | [0.54, 0.69] |  12.61 | < .001***

p-value adjustment method: Holm (1979)
Observations: 260

It looks like the Pearson correlation is 0.62 in our sample.

6.2 Simple Regression Model and its Equation

The reason I am focusing on ratings100K instead of imdb_ratings is as follows:

fit5 <- lm(imdb_stars ~ imdb_ratings, data = movies_1)

model_parameters(fit5, ci = 0.95, pretty_names = FALSE)
Parameter    | Coefficient |       SE |       95% CI | t(258) |      p
----------------------------------------------------------------------
(Intercept)  |        7.04 |     0.06 | [6.92, 7.15] | 118.94 | < .001
imdb_ratings |    9.01e-07 | 7.14e-08 | [0.00, 0.00] |  12.61 | < .001

Uncertainty intervals (equal-tailed) and p-values (two-tailed) computed
  using a Wald t-distribution approximation.

It is very hard to conceptualize \(9.01 \times 10^{−7}\) in any practical context, plus the 95% CI for the slope of imdb_ratings is now 0. That’s not helping me understand what’s going on.

fit6 <- lm(imdb_stars ~ ratings100K, data = movies_1)

model_parameters(fit6, ci = 0.95, pretty_names = FALSE)
Parameter   | Coefficient |       SE |       95% CI | t(258) |      p
---------------------------------------------------------------------
(Intercept) |        7.04 |     0.06 | [6.92, 7.15] | 118.94 | < .001
ratings100K |        0.09 | 7.14e-03 | [0.08, 0.10] |  12.61 | < .001

Uncertainty intervals (equal-tailed) and p-values (two-tailed) computed
  using a Wald t-distribution approximation.

Actually, I’d like to get another decimal place here.

model_parameters(fit6, ci = 0.95, pretty_names = FALSE, digits = 3)
Parameter   | Coefficient |    SE |         95% CI |  t(258) |      p
---------------------------------------------------------------------
(Intercept) |       7.036 | 0.059 | [6.919, 7.152] | 118.943 | < .001
ratings100K |       0.090 | 0.007 | [0.076, 0.104] |  12.607 | < .001

Uncertainty intervals (equal-tailed) and p-values (two-tailed) computed
  using a Wald t-distribution approximation.

This is a little more interpretible. If we have two movies, one of whom has 100,000 more IMDB user ratings than the other, then this model fit6 estimates that the movie with more ratings will have a weighted average stars rating (imdb_stars) that is 0.09 higher than the less rated movie.

Suppose again we assume that the population of interest is “all favorite movies for students at CWRU in any class between 2020 and 2025” and we’re willing to believe that our sample of 260 movies is a representative (though not random) sample from that population.

When we generalize beyond the movies selected here to this population of interest, then our sample data are compatible (at the 95% confidence level) with slopes for the relationship between imdb_stars and ratings100K ranging from 0.076 to 0.104, assuming our fit6 model is correct.

6.3 Model Performance

Let’s look at the model’s performance.

model_performance(fit6)
# Indices of model performance

AIC   |  AICc |   BIC |    R2 | R2 (adj.) |  RMSE | Sigma
---------------------------------------------------------
535.8 | 535.9 | 546.4 | 0.381 |     0.379 | 0.670 | 0.673
  • The \(R^2\) tells us that the fit6 model accounts for 38.1% of the variation in imdb_stars using ratings100K as a predictor.

  • We also see \(\sigma = 0.673\). This suggests that about 68% of our predictions of imdb_stars will be within \(\pm 0.673\) of the observed imdb_stars, and that about 95% will be within \(\pm 1.346\). This is better than we did with our ANOVA model in Question 2 (looking at mpa4) but still not terrific, given that the top 75% of our movies all fall in the range of 7.1 to 9.3 stars.

6.4 Assumption Checking

Let’s check our assumptions with diagnostic plots.

check_model(fit6, detrend = FALSE)

  • The posterior predictive check is a bit off. This fit6 model is predicting fewer results in the 7,5 to 8.5 star range than we see in the data, and more results below 7.5 and above 8.75, it seems.
  • The linearity plot shows a bit of a curve in the reference line.
  • The homogeneity of variance plot suggests a fairly substantial issue as well, as the line is curved, rather than flat and horizontal.
  • There are no highly influential observations affecting this ANOVA model.
  • The Normality of the ANOVA residuals also indicates some fairly substantial left skew.

6.5 Bayesian linear model?

Do we obtain meaningfully different results from a Bayesian version of this model, with a (default) weakly informative prior?

fit6b <- stan_glm(imdb_stars ~ ratings100K, 
                  data = movies_1, refresh = 0)

model_parameters(fit6b, ci = 0.95, pretty_names = FALSE, digits = 3)
Parameter   | Median |         95% CI |   pd |  Rhat |  ESS |                 Prior
-----------------------------------------------------------------------------------
(Intercept) |  7.036 | [6.922, 7.151] | 100% | 1.000 | 4326 | Normal (7.56 +- 2.13)
ratings100K |  0.090 | [0.076, 0.104] | 100% | 0.999 | 3987 | Normal (0.00 +- 0.36)

Uncertainty intervals (equal-tailed) computed using a MCMC distribution
  approximation.
model_performance(fit6b)
# Indices of model performance

ELPD     | ELPD_SE | LOOIC | LOOIC_SE |  WAIC |    R2 | R2 (adj.) |  RMSE | Sigma
---------------------------------------------------------------------------------
-269.370 |  24.030 | 538.7 |   48.060 | 538.7 | 0.380 |     0.376 | 0.670 | 0.675
check_model(fit6b, detrend = FALSE)

The fit6 and fit6b models are pretty similar in terms of estimated parameters, performance metrics and diagnostic checks. I’m going to conclude that the Bayesian model here looks pretty much like our ordinary least squares model, so I’ll move on.

6.6 Considering a Transformation

Could a transformation of our outcome be helpful here?

boxCox(fit6)

The Box-Cox plot is a little cut off on the right side. Can we fix that?

boxCox(fit6, lambda = seq(-2, 10, 1/5))

This seems to suggest that we take the imdb_stars value to about the 5th power. That seems ridiculous - in practical work, I would never use a power higher than 3 on an outcome.

Suppose we instead just cube the imdb_stars. How’s that scatterplot look?

ggplot(movies_1, aes(x = ratings100K, y = (imdb_stars^3))) +
  geom_point() +
  geom_smooth(method = "loess", 
              formula = y ~ x, se = FALSE, col = "blue") +
  geom_smooth(method = "lm", 
              formula = y ~ x, se = TRUE, col = "red") +
  labs(x = "Hundreds of Thousands of IMDB ratings",
       y = "Cube of Weighted average star rating")

It’s not obvious to me that this is going to be a meaningful improvement, but we can check it.

6.7 New model for transformed outcome

Here’s the new model:

fit7 <- lm(imdb_stars^3 ~ ratings100K, data = movies_1)

model_parameters(fit7)
Parameter   | Coefficient |   SE |           95% CI | t(258) |      p
---------------------------------------------------------------------
(Intercept) |      356.20 | 8.27 | [339.92, 372.48] |  43.09 | < .001
ratings100K |       15.73 | 1.00 | [ 13.76,  17.69] |  15.75 | < .001

Uncertainty intervals (equal-tailed) and p-values (two-tailed) computed
  using a Wald t-distribution approximation.

We might consider summaries of its performance (remember, though, we changed the outcome to the square of imdb_stars so these results, like the estimated coefficients, \(R^2\) and \(\sigma\) are NOT comparable to the values we got for model fit6.)

model_performance(fit7)
# Indices of model performance

AIC   |  AICc |   BIC |    R2 | R2 (adj.) |   RMSE |  Sigma
-----------------------------------------------------------
436.4 | 436.5 | 447.1 | 0.490 |     0.488 | 93.650 | 94.012

Perhaps the most important thing to do here is see if the transformation meaningfully improved our adherence to the assumptions of a linear model. Let’s focus on that.

check_model(fit7, detrend = FALSE)

If anything, this looks meaningfully worse, in terms of the posterior predictive check, so it doesn’t look like a simple transformation of the outcome is going to fix our problem.

6.8 What if we transformed both Y and X?

See Section 11.6 of our Course Book for another example of a log-log model like this, and its interpretation.

ggplot(movies_1, aes(x = log(ratings100K), y = log(imdb_stars))) +
  geom_point() +
  geom_smooth(method = "loess", 
              formula = y ~ x, se = FALSE, col = "blue") +
  geom_smooth(method = "lm", 
              formula = y ~ x, se = TRUE, col = "red") +
  labs(x = "Logarithm of Hundreds of Thousands of IMDB ratings",
       y = "Logarithm of Weighted average star rating")

Does this log-log pair of transformations fare any better, in terms of assumptions?

fit8 <- lm(log(imdb_stars) ~ log(ratings100K), data = movies_1)
model_parameters(fit8, ci = 0.95, pretty_names = FALSE)
Parameter        | Coefficient |       SE |       95% CI | t(258) |      p
--------------------------------------------------------------------------
(Intercept)      |        1.95 | 8.59e-03 | [1.93, 1.96] | 226.50 | < .001
log(ratings100K) |        0.06 | 4.92e-03 | [0.05, 0.07] |  12.21 | < .001

Uncertainty intervals (equal-tailed) and p-values (two-tailed) computed
  using a Wald t-distribution approximation.

The model has a log-transformed response variable. Consider using
  `exponentiate = TRUE` to interpret coefficients as ratios.
model_performance(fit8)
# Indices of model performance

AIC   |  AICc |   BIC |    R2 | R2 (adj.) |  RMSE | Sigma
---------------------------------------------------------
603.0 | 603.1 | 613.6 | 0.366 |     0.364 | 0.102 | 0.102
check_model(fit8, detrend = FALSE)

Nope. This is worse than what we started with.

6.9 Answering Question 3

Our regression model fit6 without any transformations explained about 38.1% of the variation in imdb_stars using ratings100K as a predictor. That looks to be our best option right now to answer our question 3.

7 Question 4

  • Which movie genres have the highest weighted average star ratings on IMDB?

7.1 Creating a New Data Set

Some of my least interesting programming follows.

row01 <- movies_1 |> 
  filter(action == 1) |>
  summarise(genre = "action", 
            n = n(), 
            mean_stars = mean(imdb_stars), 
            sd_stars = sd(imdb_stars),
            median_stars = median(imdb_stars),
            mad_stars = mad(imdb_stars),
            q25_stars = quantile(imdb_stars, 0.25),
            q75_stars = quantile(imdb_stars, 0.75))
row02 <- movies_1 |> 
  filter(adventure == 1) |>
  summarise(genre = "adventure", 
            n = n(), 
            mean_stars = mean(imdb_stars), 
            sd_stars = sd(imdb_stars),
            median_stars = median(imdb_stars),
            mad_stars = mad(imdb_stars),
            q25_stars = quantile(imdb_stars, 0.25),
            q75_stars = quantile(imdb_stars, 0.75))
row03 <- movies_1 |> 
  filter(animation == 1) |>
  summarise(genre = "animation", 
            n = n(), 
            mean_stars = mean(imdb_stars), 
            sd_stars = sd(imdb_stars),
            median_stars = median(imdb_stars),
            mad_stars = mad(imdb_stars),
            q25_stars = quantile(imdb_stars, 0.25),
            q75_stars = quantile(imdb_stars, 0.75))
row04 <- movies_1 |> 
  filter(biography == 1) |>
  summarise(genre = "biography", 
            n = n(), 
            mean_stars = mean(imdb_stars), 
            sd_stars = sd(imdb_stars),
            median_stars = median(imdb_stars),
            mad_stars = mad(imdb_stars),
            q25_stars = quantile(imdb_stars, 0.25),
            q75_stars = quantile(imdb_stars, 0.75))
row05 <- movies_1 |> 
  filter(comedy == 1) |>
  summarise(genre = "comedy", 
            n = n(), 
            mean_stars = mean(imdb_stars), 
            sd_stars = sd(imdb_stars),
            median_stars = median(imdb_stars),
            mad_stars = mad(imdb_stars),
            q25_stars = quantile(imdb_stars, 0.25),
            q75_stars = quantile(imdb_stars, 0.75))
row06 <- movies_1 |> 
  filter(crime == 1) |>
  summarise(genre = "crime", 
            n = n(), 
            mean_stars = mean(imdb_stars), 
            sd_stars = sd(imdb_stars),
            median_stars = median(imdb_stars),
            mad_stars = mad(imdb_stars),
            q25_stars = quantile(imdb_stars, 0.25),
            q75_stars = quantile(imdb_stars, 0.75))
row07 <- movies_1 |> 
  filter(drama == 1) |>
  summarise(genre = "drama", 
            n = n(), 
            mean_stars = mean(imdb_stars), 
            sd_stars = sd(imdb_stars),
            median_stars = median(imdb_stars),
            mad_stars = mad(imdb_stars),
            q25_stars = quantile(imdb_stars, 0.25),
            q75_stars = quantile(imdb_stars, 0.75))
row08 <- movies_1 |> 
  filter(family == 1) |>
  summarise(genre = "family", 
            n = n(), 
            mean_stars = mean(imdb_stars), 
            sd_stars = sd(imdb_stars),
            median_stars = median(imdb_stars),
            mad_stars = mad(imdb_stars),
            q25_stars = quantile(imdb_stars, 0.25),
            q75_stars = quantile(imdb_stars, 0.75))
row09 <- movies_1 |> 
  filter(fantasy == 1) |>
  summarise(genre = "fantasy", 
            n = n(), 
            mean_stars = mean(imdb_stars), 
            sd_stars = sd(imdb_stars),
            median_stars = median(imdb_stars),
            mad_stars = mad(imdb_stars),
            q25_stars = quantile(imdb_stars, 0.25),
            q75_stars = quantile(imdb_stars, 0.75))
row10 <- movies_1 |> 
  filter(history == 1) |>
  summarise(genre = "history", 
            n = n(), 
            mean_stars = mean(imdb_stars), 
            sd_stars = sd(imdb_stars),
            median_stars = median(imdb_stars),
            mad_stars = mad(imdb_stars),
            q25_stars = quantile(imdb_stars, 0.25),
            q75_stars = quantile(imdb_stars, 0.75))
row11 <- movies_1 |> 
  filter(horror == 1) |>
  summarise(genre = "horror", 
            n = n(), 
            mean_stars = mean(imdb_stars), 
            sd_stars = sd(imdb_stars),
            median_stars = median(imdb_stars),
            mad_stars = mad(imdb_stars),
            q25_stars = quantile(imdb_stars, 0.25),
            q75_stars = quantile(imdb_stars, 0.75))
row12 <- movies_1 |> 
  filter(mystery == 1) |>
  summarise(genre = "mystery", 
            n = n(), 
            mean_stars = mean(imdb_stars), 
            sd_stars = sd(imdb_stars),
            median_stars = median(imdb_stars),
            mad_stars = mad(imdb_stars),
            q25_stars = quantile(imdb_stars, 0.25),
            q75_stars = quantile(imdb_stars, 0.75))
row13 <- movies_1 |> 
  filter(music == 1) |>
  summarise(genre = "music", 
            n = n(), 
            mean_stars = mean(imdb_stars), 
            sd_stars = sd(imdb_stars),
            median_stars = median(imdb_stars),
            mad_stars = mad(imdb_stars),
            q25_stars = quantile(imdb_stars, 0.25),
            q75_stars = quantile(imdb_stars, 0.75))
row14 <- movies_1 |> 
  filter(musical == 1) |>
  summarise(genre = "musical", 
            n = n(), 
            mean_stars = mean(imdb_stars), 
            sd_stars = sd(imdb_stars),
            median_stars = median(imdb_stars),
            mad_stars = mad(imdb_stars),
            q25_stars = quantile(imdb_stars, 0.25),
            q75_stars = quantile(imdb_stars, 0.75))
row15 <- movies_1 |> 
  filter(romance == 1) |>
  summarise(genre = "romance", 
            n = n(), 
            mean_stars = mean(imdb_stars), 
            sd_stars = sd(imdb_stars),
            median_stars = median(imdb_stars),
            mad_stars = mad(imdb_stars),
            q25_stars = quantile(imdb_stars, 0.25),
            q75_stars = quantile(imdb_stars, 0.75))
row16 <- movies_1 |> 
  filter(scifi == 1) |>
  summarise(genre = "scifi", 
            n = n(), 
            mean_stars = mean(imdb_stars), 
            sd_stars = sd(imdb_stars),
            median_stars = median(imdb_stars),
            mad_stars = mad(imdb_stars),
            q25_stars = quantile(imdb_stars, 0.25),
            q75_stars = quantile(imdb_stars, 0.75))
row17 <- movies_1 |> 
  filter(sport == 1) |>
  summarise(genre = "sport", 
            n = n(), 
            mean_stars = mean(imdb_stars), 
            sd_stars = sd(imdb_stars),
            median_stars = median(imdb_stars),
            mad_stars = mad(imdb_stars),
            q25_stars = quantile(imdb_stars, 0.25),
            q75_stars = quantile(imdb_stars, 0.75))
row18 <- movies_1 |> 
  filter(thriller == 1) |>
  summarise(genre = "thriller", 
            n = n(), 
            mean_stars = mean(imdb_stars), 
            sd_stars = sd(imdb_stars),
            median_stars = median(imdb_stars),
            mad_stars = mad(imdb_stars),
            q25_stars = quantile(imdb_stars, 0.25),
            q75_stars = quantile(imdb_stars, 0.75))
row19 <- movies_1 |> 
  filter(war == 1) |>
  summarise(genre = "war", 
            n = n(), 
            mean_stars = mean(imdb_stars), 
            sd_stars = sd(imdb_stars),
            median_stars = median(imdb_stars),
            mad_stars = mad(imdb_stars),
            q25_stars = quantile(imdb_stars, 0.25),
            q75_stars = quantile(imdb_stars, 0.75))
row20 <- movies_1 |> 
  filter(western == 1) |>
  summarise(genre = "western", 
            n = n(), 
            mean_stars = mean(imdb_stars), 
            sd_stars = sd(imdb_stars),
            median_stars = median(imdb_stars),
            mad_stars = mad(imdb_stars),
            q25_stars = quantile(imdb_stars, 0.25),
            q75_stars = quantile(imdb_stars, 0.75))
mov1_summary <- bind_rows(row01, row02, row03, row04,
                          row05, row06, row07, row08,
                          row09, row10, row11, row12,
                          row13, row14, row15, row16,
                          row17, row18, row19, row20)
mov1_summary |> arrange(desc(mean_stars)) |>
  kable()
genre n mean_stars sd_stars median_stars mad_stars q25_stars q75_stars
war 11 8.063636 0.4500505 8.10 0.59304 7.800 8.450
biography 17 7.929412 0.4194710 8.00 0.29652 7.800 8.200
history 6 7.916667 0.4355074 8.05 0.29652 7.850 8.175
western 2 7.850000 0.0707107 7.85 0.07413 7.825 7.875
crime 34 7.720588 0.9741444 7.85 0.81543 7.225 8.300
animation 29 7.703448 0.5784726 7.70 0.44478 7.600 8.000
drama 153 7.697386 0.8402653 7.80 0.59304 7.300 8.200
thriller 50 7.622000 0.7089198 7.70 0.74130 7.200 8.175
adventure 87 7.614942 0.7533690 7.70 0.74130 7.200 8.200
scifi 51 7.596078 0.8175477 7.70 0.88956 7.100 8.300
mystery 26 7.576923 0.6901059 7.70 0.59304 7.225 7.900
fantasy 60 7.560000 0.8793680 7.70 0.59304 7.200 8.000
action 60 7.508333 0.8415667 7.50 1.03782 6.875 8.200
family 44 7.422727 0.7426739 7.70 0.59304 7.075 7.900
romance 62 7.406452 0.6885173 7.45 0.66717 7.025 7.900
music 30 7.393333 0.8529597 7.55 0.81543 6.850 8.000
sport 6 7.366667 0.4501851 7.50 0.37065 7.175 7.675
comedy 103 7.250485 0.8209122 7.40 0.74130 6.900 7.800
musical 17 7.205882 0.6823403 7.30 0.59304 6.800 7.700
horror 12 6.966667 1.5417424 7.15 1.55673 6.450 8.175

7.2 What about overlap?

Of course, I’ve completely ignored the issue of movies with multiple genres here.

For example, suppose we’re interested in comedies and dramas, plus the movies that are both, and those that are neither. We’d need a summary like this.

movies_1 |> group_by(comedy, drama) |>
  summarize(n = n(), 
            mean_stars = mean(imdb_stars), 
            sd_stars = sd(imdb_stars),
            median_stars = median(imdb_stars),
            mad_stars = mad(imdb_stars),
            .groups = "keep") |>
  kable(digits = 2)
comedy drama n mean_stars sd_stars median_stars mad_stars
0 0 54 7.64 0.77 7.8 0.74
0 1 103 7.84 0.83 8.0 0.59
1 0 53 7.10 0.82 7.1 0.89
1 1 50 7.41 0.79 7.6 0.59

7.3 A Cleveland dot plot

Here is a Cleveland dot plot of the summarized data on the individual genres. Cleveland dot plots are an alternative to bar graphs that reduce visual clutter and may be easier to read.

ggplot(data = mov1_summary, aes(x = mean_stars, y = genre)) +
  geom_point()

Here is a fancier version, based on the recipe in Section 3.10 of the R Graphics Cookbook

ggplot(data = mov1_summary, 
       aes(x = mean_stars, y = reorder(genre, mean_stars))) +
  geom_point(size = 3) +
  theme(
    panel.grid.major.x = element_blank(),
    panel.grid.minor.x = element_blank(),
    panel.grid.major.y = element_line(colour = "grey60", linetype = "dashed")
  ) +
  labs(y = "Movie Genre", x = "Mean Star Rating on IMDB")

Here is a plot of the median, 25th and 75th percentiles of star ratings for each genre instead.

ggplot(data = mov1_summary, 
       aes(x = median_stars, y = reorder(genre, median_stars))) +
  geom_pointrange(aes(xmin = q25_stars, xmax = q75_stars)) +
  labs(y = "Movie Genre", x = "Median (and Q25, Q75) of Star Rating on IMDB")

Note that there are only two western movies in our data, so calculating the percentiles is a problem.

8 Session Information

session_info()
R version 4.5.1 (2025-06-13 ucrt)
Platform: x86_64-w64-mingw32/x64
Running under: Windows 11 x64 (build 26100)

Locale:
  LC_COLLATE=English_United States.utf8 
  LC_CTYPE=English_United States.utf8   
  LC_MONETARY=English_United States.utf8
  LC_NUMERIC=C                          
  LC_TIME=English_United States.utf8    

Package version:
  abind_1.4-8            arrangements_1.1.9     askpass_1.2.1         
  backports_1.5.0        base_4.5.1             base64enc_0.1-3       
  bayesplot_1.14.0       bayestestR_0.17.0      BH_1.87.0.1           
  bit_4.6.0              bit64_4.6.0.1          blob_1.2.4            
  boot_1.3-32            broom_1.0.10           bslib_0.9.0           
  cachem_1.1.0           callr_3.7.6            car_3.1-3             
  carData_3.0-5          cellranger_1.1.0       checkmate_2.3.3       
  class_7.3-23           cli_3.6.5              clipr_0.8.0           
  coda_0.19-4.1          codetools_0.2-20       collapse_2.1.3        
  colourpicker_1.3.0     commonmark_2.0.0       compiler_4.5.1        
  conflicted_1.2.0       correlation_0.8.8      cowplot_1.2.0         
  cpp11_0.5.2            crayon_1.5.3           crosstalk_1.2.2       
  curl_7.0.0             data.table_1.17.8      datasets_4.5.1        
  datawizard_1.2.0       DBI_1.2.3              dbplyr_2.5.1          
  Deriv_4.2.0            desc_1.4.3             DescTools_0.99.60     
  digest_0.6.37          distributional_0.5.0   doBy_4.7.0            
  dplyr_1.1.4            DT_0.34.0              dtplyr_1.3.2          
  dygraphs_1.1.1.6       e1071_1.7-16           easystats_0.7.5       
  effectsize_1.0.1       emmeans_1.11.2-8       estimability_1.5.1    
  evaluate_1.0.5         Exact_3.3              exactRankTests_0.8-35 
  expm_1.0-0             farver_2.1.2           fastmap_1.2.0         
  fontawesome_0.5.3      forcats_1.0.1          foreach_1.5.2         
  Formula_1.2-5          fs_1.6.6               gargle_1.6.0          
  generics_0.1.4         ggplot2_4.0.0          ggrepel_0.9.6         
  ggridges_0.5.7         gld_2.6.8              glmnet_4.1-10         
  glue_1.8.0             gmp_0.7-5              googledrive_2.1.2     
  googlesheets4_1.1.2    graphics_4.5.1         grDevices_4.5.1       
  grid_4.5.1             gridExtra_2.3          gtable_0.3.6          
  gtools_3.9.5           haven_2.5.5            highr_0.11            
  hms_1.1.3              htmltools_0.5.8.1      htmlwidgets_1.6.4     
  httpuv_1.6.16          httr_1.4.7             ids_1.0.1             
  igraph_2.1.4           infer_1.0.9            inline_0.3.21         
  insight_1.4.2          isoband_0.2.7          iterators_1.0.14      
  janitor_2.2.1          jomo_2.7-6             jquerylib_0.1.4       
  jsonlite_2.0.0         kableExtra_1.4.0       knitr_1.50            
  labeling_0.4.3         later_1.4.4            lattice_0.22-7        
  lazyeval_0.2.2         lifecycle_1.0.4        litedown_0.7          
  lme4_1.1-37            lmom_3.2               loo_2.8.0             
  lubridate_1.9.4        magrittr_2.0.4         marginaleffects_0.30.0
  markdown_2.0           MASS_7.3-65            Matrix_1.7-4          
  MatrixModels_0.5.4     matrixStats_1.5.0      memoise_2.0.1         
  methods_4.5.1          mgcv_1.9-3             mice_3.18.0           
  miceadds_3.18-36       microbenchmark_1.5.0   mime_0.13             
  miniUI_0.1.2           minqa_1.2.8            mitml_0.4-5           
  mitools_2.4            MKdescr_0.9            MKinfer_1.2           
  modelbased_0.13.0      modelr_0.1.11          multcomp_1.4-28       
  mvtnorm_1.3-3          naniar_1.1.0           nlme_3.1-168          
  nloptr_2.2.1           nnet_7.3-20            norm_1.0.11.1         
  numDeriv_2016.8.1.1    openssl_2.3.4          ordinal_2023.12.4.1   
  pan_1.9                parallel_4.5.1         parameters_0.28.2     
  patchwork_1.3.2        pbkrtest_0.5.5         performance_0.15.1    
  pillar_1.11.1          pkgbuild_1.4.8         pkgconfig_2.0.3       
  plyr_1.8.9             posterior_1.6.1        prettyunits_1.2.0     
  processx_3.8.6         progress_1.2.3         promises_1.3.3        
  proxy_0.4-27           ps_1.9.1               purrr_1.1.0           
  quantreg_6.1           QuickJSR_1.8.1         R6_2.6.1              
  ragg_1.5.0             rappdirs_0.3.3         rbibutils_2.3         
  RColorBrewer_1.1-3     Rcpp_1.1.0             RcppArmadillo_15.0.2.2
  RcppEigen_0.3.4.0.2    RcppParallel_5.1.11-1  Rdpack_2.6.4          
  readr_2.1.5            readxl_1.4.5           reformulas_0.4.1      
  rematch_2.0.0          rematch2_2.1.2         report_0.6.1          
  reprex_2.1.1           reshape2_1.4.4         rlang_1.1.6           
  rmarkdown_2.30         rootSolve_1.8.2.4      rpart_4.1.24          
  rstan_2.32.7           rstanarm_2.32.2        rstantools_2.5.0      
  rstudioapi_0.17.1      rvest_1.0.5            S7_0.2.0              
  sandwich_3.1-1         sass_0.4.10            scales_1.4.0          
  see_0.12.0             selectr_0.4.2          shape_1.4.6.1         
  shiny_1.11.1           shinyjs_2.1.0          shinystan_2.6.0       
  shinythemes_1.2.0      snakecase_0.11.1       sourcetools_0.1.7.1   
  SparseM_1.84.2         splines_4.5.1          StanHeaders_2.32.10   
  stats_4.5.1            stats4_4.5.1           stringi_1.8.7         
  stringr_1.5.2          survival_3.8-3         svglite_2.2.1         
  sys_3.4.3              systemfonts_1.3.0      tensorA_0.36.2.1      
  textshaping_1.0.3      TH.data_1.1-4          threejs_0.3.4         
  tibble_3.3.0           tidyr_1.3.1            tidyselect_1.2.1      
  tidyverse_2.0.0        timechange_0.3.0       tinytex_0.57          
  tools_4.5.1            tzdb_0.5.0             ucminf_1.2.2          
  UpSetR_1.4.0           utf8_1.2.6             utils_4.5.1           
  uuid_1.2.1             V8_8.0.0               vctrs_0.6.5           
  viridis_0.6.5          viridisLite_0.4.2      visdat_0.6.0          
  vroom_1.6.6            withr_3.0.2            xfun_0.53             
  xml2_1.4.0             xtable_1.8-4           xts_0.14.1            
  yaml_2.3.10            zoo_1.8-14